Skip to content

[Draft] 355_benchmark#570

Draft
wenxie-amd wants to merge 5 commits intomainfrom
dev/355_benchmark
Draft

[Draft] 355_benchmark#570
wenxie-amd wants to merge 5 commits intomainfrom
dev/355_benchmark

Conversation

@wenxie-amd
Copy link
Contributor

No description provided.

@wenxie-amd wenxie-amd requested review from Copilot and removed request for Copilot February 27, 2026 12:46
Copilot AI review requested due to automatic review settings February 27, 2026 13:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds benchmark/training convenience scripts and configuration tweaks for MI355X DeepSeek pretraining runs, plus enhanced training-log memory reporting and some workflow/build adjustments.

Changes:

  • Add new launcher scripts for DeepSeek V3 and DeepSeek V2 Lite training, and a C4 data preparation helper.
  • Extend Megatron training-log patching to report the rank with max ROCm memory usage.
  • Update Slurm/docker launch behavior and Megatron prepare/build flags; adjust DeepSeek V3 MI355X config defaults; comment out CI unit-test jobs.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
start_training_dsv3.sh New MI355X DeepSeek V3 Slurm launch script with env defaults and run flags.
start_training_dsv2_lite.sh New MI355X DeepSeek V2 Lite Slurm launch script with env defaults and run flags.
prepare_c4_data.sh New helper to merge pre-downloaded C4 shards and tokenize for Megatron.
primus/backends/megatron/patches/training_log/print_rank_last_patches.py Add distributed aggregation to log the max ROCm memory usage rank.
examples/run_slurm_pretrain.sh Allow overriding srun time/partition via env.
examples/run_pretrain.sh Update AINIC NCCL IB TC/FIFO TC env settings.
examples/run_local_pretrain.sh Prefer docker over podman and pull image before running.
examples/megatron/prepare.py Install Emerging-Optimizers with --no-build-isolation.
examples/megatron/configs/MI355X/deepseek_v3-BF16-pretrain.yaml Change PP default and add Turbo/Deepep-related overrides.
.github/workflows/ci.yaml Comment out torch/jax unit-test jobs.

Comment on lines +167 to +168
docker_podman_proxy pull "$DOCKER_IMAGE"

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditionally pulling the image before running means this script will now fail in offline/air-gapped environments (or when using a locally-built image tag) even if the image is already present locally. Consider making the pull optional (e.g., behind a PULL_DOCKER_IMAGE flag) or using a runtime option like "--pull=missing" where supported.

Copilot uses AI. Check for mistakes.
export NCCL_IB_HCA="ionic_0:1,ionic_2:1,ionic_3:1,ionic_4:1,ionic_5:1,ionic_7:1,ionic_8:1,ionic_9:1"
export GLOO_SOCKET_IFNAME=ens9np0
export NCCL_SOCKET_IFNAME=ens9np0
export CLEAN_DOCKER_CONTAINER=1
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLEAN_DOCKER_CONTAINER=1 will cause examples/run_local_pretrain.sh to remove all containers on the host (it runs "docker/podman ps -aq" and rm -f each). That’s a risky default for shared nodes; consider defaulting this to 0 and only enabling it explicitly when you’re sure it’s safe.

Suggested change
export CLEAN_DOCKER_CONTAINER=1
# Set to 1 to allow run_slurm_pretrain.sh to clean up all Docker/Podman containers on the host.
# Use 1 only on dedicated/non-shared nodes where this is safe.
export CLEAN_DOCKER_CONTAINER=0

Copilot uses AI. Check for mistakes.
export PRIMUS_USER=tas
# export PRIMUS_TOKENIZED_DATA_PATH=/shared_aig/c4/tokenized/c4_en_train_text_document # this is the tokenized data path for the training
export PRIMUS_EXP_NAME=dsv3-pretrain-mbs_$MBS-gbs_$GBS-PP_$PRIMUS_PP-EP_$PRIMUS_EP-VPP_$PRIMUS_VPP-turbodeepep_$TURBO_DEEPEEP-legacygg_$LEGACY_GG-profile_$PROFILE
export PRIMUS_EXP_NAME=debugv3
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRIMUS_EXP_NAME is assigned twice; the second assignment overrides the descriptive name built from the run parameters, so logs/output will always go to the hard-coded "debugv3" directory. Remove the second export or gate it behind an explicit debug flag to avoid accidental output collisions.

Suggested change
export PRIMUS_EXP_NAME=debugv3
: "${PRIMUS_DEBUG_EXP_NAME:=0}"
if [ "$PRIMUS_DEBUG_EXP_NAME" = "1" ]; then
export PRIMUS_EXP_NAME=debugv3
fi

Copilot uses AI. Check for mistakes.
Comment on lines +16 to +29
# By default downloads 1 shard (~350MB compressed, ~3M documents) for testing.
# Full C4-en has 1024 shards. Adjust --num_shards for more data.
###############################################################################

set -e

# ======================== Configuration ========================
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
TOKENIZER_TYPE="DeepSeekV3Tokenizer"
TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3"
WORKERS=${WORKERS:-$(nproc)} # Number of preprocessing workers
HF_TOKEN=${HF_TOKEN:-"your_hf_token"} # Set your HuggingFace token
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says the script "downloads" C4 shards and that the default is 1 shard, but the implementation explicitly skips downloading and defaults NUM_SHARDS to 200. Please align the documentation and defaults with the actual behavior (either implement download, or update comments and set NUM_SHARDS default accordingly).

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +92
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
done

Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This merge step appends directly into the final JSONL_FILE with '>>'. If the script is interrupted or a shard is corrupt, you'll end up with a partial JSONL that then causes future runs to skip merging because the file exists. Write to a temporary file and atomically move it into place on success (and/or validate the output) to avoid leaving a bad cached artifact.

Suggested change
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
zcat "${SHARD_PATH}" >> "${JSONL_FILE}"
done
# Write to a temporary file first to avoid leaving a corrupted final JSONL
TMP_JSONL_FILE="$(mktemp "${JSONL_DIR}/c4_en_train.jsonl.tmp.XXXXXX")"
# Ensure the temporary file is cleaned up on failure or interruption
cleanup_tmp() {
if [ -n "${TMP_JSONL_FILE:-}" ] && [ -f "${TMP_JSONL_FILE}" ]; then
rm -f "${TMP_JSONL_FILE}"
fi
}
trap cleanup_tmp EXIT INT TERM
for i in $(seq 0 $((NUM_SHARDS - 1))); do
SHARD_NAME=$(printf "c4-train.%05d-of-01024.json.gz" "$i")
SHARD_PATH="${RAW_DIR}/${SHARD_NAME}"
echo " [${i}/${NUM_SHARDS}] Decompressing ${SHARD_NAME} ..."
zcat "${SHARD_PATH}" >> "${TMP_JSONL_FILE}"
done
# Basic validation: ensure the merged file is non-empty before finalizing
if [ ! -s "${TMP_JSONL_FILE}" ]; then
echo "ERROR: Merged JSONL is empty; aborting."
cleanup_tmp
exit 1
fi
# Move the completed temp file into place atomically
mv "${TMP_JSONL_FILE}" "${JSONL_FILE}"
# Prevent trap from deleting the now-final JSONL file
TMP_JSONL_FILE=""
trap - EXIT INT TERM

Copilot uses AI. Check for mistakes.
export NCCL_IB_HCA="ionic_0:1,ionic_2:1,ionic_3:1,ionic_4:1,ionic_5:1,ionic_7:1,ionic_8:1,ionic_9:1"
export GLOO_SOCKET_IFNAME=ens9np0
export NCCL_SOCKET_IFNAME=ens9np0
export CLEAN_DOCKER_CONTAINER=1
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLEAN_DOCKER_CONTAINER=1 will cause examples/run_local_pretrain.sh to remove all containers on the host (it runs "docker/podman ps -aq" and rm -f each). That’s a risky default for shared nodes; consider defaulting this to 0 and only enabling it explicitly when you’re sure it’s safe.

Suggested change
export CLEAN_DOCKER_CONTAINER=1
export CLEAN_DOCKER_CONTAINER=0 # Set to 1 to forcibly remove all Docker/Podman containers; use with caution on shared hosts

Copilot uses AI. Check for mistakes.
r_ratio = r_used / r_total

# get the max rocm_mem_usage
usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usage_tensor is created as float32 even though r_used is a byte count on the order of 1e11 for large GPUs. float32 will lose integer precision at that scale, which can make max-rank selection and reported GB values slightly inaccurate. Prefer using an integer dtype (e.g., int64) for the gathered byte counts, then format as GB on CPU.

Suggested change
usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)
usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.int64)

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exports placeholder values for HF_TOKEN and WANDB_API_KEY, which will overwrite any real credentials already set in the environment and can cause training/CI runs to fail unless the file is edited. Prefer reading from the environment and erroring if unset (or only setting defaults when variables are empty), rather than hard-coding placeholder strings.

Suggested change
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
# Require HF_TOKEN and WANDB_API_KEY to be provided via the environment.
if [ -z "${HF_TOKEN:-}" ]; then
echo "Error: HF_TOKEN environment variable is not set. Please export HF_TOKEN before running this script." >&2
exit 1
fi
if [ -z "${WANDB_API_KEY:-}" ]; then
echo "Error: WANDB_API_KEY environment variable is not set. Please export WANDB_API_KEY before running this script." >&2
exit 1
fi
export HF_TOKEN
export WANDB_API_KEY

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +25
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRIMUS_PATH defaults to a user-specific absolute path (/shared/john/Primus), which makes this script non-portable and likely to fail for other users/environments. Consider requiring PRIMUS_PATH to be provided (and exit with a clear message if unset) or deriving it relative to the repo root.

Suggested change
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"/shared/john/Primus"}
SCRIPT_DIR="$(cd -- "$(dirname "${BASH_SOURCE[0]}")" >/dev/null 2>&1 && pwd)"
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
DATA_DIR=${DATA_DIR:-"/shared/c4"}
PRIMUS_PATH=${PRIMUS_PATH:-"${SCRIPT_DIR}/../Primus"}
if [[ ! -d "$PRIMUS_PATH" ]]; then
echo "Error: PRIMUS_PATH is not set to a valid directory: '$PRIMUS_PATH'" >&2
echo "Please set PRIMUS_PATH explicitly, for example:" >&2
echo " export PRIMUS_PATH=/path/to/Primus" >&2
exit 1
fi

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +185
# run-unittest-torch:
# env:
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch
# needs: [code-lint]
# runs-on: [primus-lm-cicd-torch-j8knc]
# steps:
# - run: echo "🎉 Begin Primus-Turbo Checkout."
# - name: Set commit hash to env
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit test jobs for both torch and jax are commented out, which removes automated test execution from CI and can allow regressions (including in this PR) to merge unnoticed. If this is temporary, consider gating them behind a workflow input/label, reducing the test scope, or moving them to a separate workflow instead of disabling them entirely.

Copilot uses AI. Check for mistakes.
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 28, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 28, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 28, 2026
@AMD-AGI AMD-AGI deleted a comment from Copilot AI Feb 28, 2026
Copilot AI review requested due to automatic review settings March 3, 2026 03:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (2)

prepare_c4_data.sh:30

  • HF_TOKEN is defined here but is never used by this script (Step 1 explicitly skips download, and Step 2 calls preprocess_data.py without passing auth). Either remove this variable or use it in the download/auth flow to avoid implying it has an effect.
TOKENIZER_MODEL="deepseek-ai/DeepSeek-V3"
WORKERS=${WORKERS:-$(nproc)}          # Number of preprocessing workers
HF_TOKEN=${HF_TOKEN:-"your_hf_token"}             # Set your HuggingFace token

prepare_c4_data.sh:139

  • The summary message references run_dsv3.sh, but this repo’s training entrypoints appear to be start_training_dsv3.sh / examples/run_slurm_pretrain.sh. Consider updating the wording to point to the actual script(s) users should run to avoid confusion.
echo "To use this data for training, set in run_dsv3.sh:"
echo ""
echo "  1. Change:  --mock_data True  →  --mock_data False"
echo "  2. Add env:  export PRIMUS_TOKENIZED_DATA_PATH=${TRAIN_OUTPUT_PREFIX}_text_document"

Comment on lines +46 to +48
--time="${SLURM_TIME:-07:00:00}" \
--nodelist="${SLURM_NODELIST:-}" \
--partition="${SLURM_PARTITION:-amd-aig}" \
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing --nodelist with an empty value (when SLURM_NODELIST is unset) causes srun to fail with an invalid nodelist. Consider only adding --nodelist when SLURM_NODELIST is non-empty (or default to omitting the flag entirely).

Copilot uses AI. Check for mistakes.
ARGS=("$@")

VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-codes an extra bind mount to /shared_aig/c4, which will break (or unintentionally create an empty host directory) on systems that don’t have that path. Consider making this conditional/configurable via an env var (e.g., EXTRA_VOLUME_ARGS / PRIMUS_EXTRA_MOUNTS) rather than always mounting a cluster-specific path.

Suggested change
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH" -v "/shared_aig/c4:/shared_aig/c4")
VOLUME_ARGS=(-v "$PRIMUS_PATH":"$PRIMUS_PATH" -v "$DATA_PATH":"$DATA_PATH")
# Optional extra volume mounts: set PRIMUS_EXTRA_MOUNTS to a string like:
# '-v /shared_aig/c4:/shared_aig/c4 -v /other/path:/other/path:ro'
if [[ -n "${PRIMUS_EXTRA_MOUNTS:-}" ]]; then
# Intentional word splitting to allow multiple -v arguments.
VOLUME_ARGS+=(${PRIMUS_EXTRA_MOUNTS})
elif [[ -d "/shared_aig/c4" ]]; then
# Backwards-compatible default: only mount /shared_aig/c4 if it exists.
VOLUME_ARGS+=(-v "/shared_aig/c4:/shared_aig/c4")
fi

Copilot uses AI. Check for mistakes.
set -e

# ======================== Configuration ========================
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment says the default downloads 1 shard for testing, but NUM_SHARDS is set to default to 200. Please update either the comment or the default so they match (and avoid surprising users by downloading/processing 200 shards by default).

Suggested change
NUM_SHARDS=${NUM_SHARDS:-200} # Number of C4 shards to download (1-1024)
NUM_SHARDS=${NUM_SHARDS:-1} # Number of C4 shards to download (1-1024)

Copilot uses AI. Check for mistakes.
Comment on lines +267 to 271
ret = subprocess.run(
["pip", "install", "--no-build-isolation", "-e", str(emerging_optimizers_path)], check=True
)
if ret.returncode != 0:
log_error_and_exit("Building Emerging Optimizers failed.")
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess.run(..., check=True) will raise an exception on failure, so the subsequent if ret.returncode != 0: branch is redundant/unreachable. Either remove the returncode check, or set check=False and keep the explicit returncode handling (and ideally invoke pip via sys.executable -m pip to ensure the active Python environment is used).

Copilot uses AI. Check for mistakes.
Comment on lines +178 to +204
# run-unittest-torch:
# env:
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch
# needs: [code-lint]
# runs-on: [primus-lm-cicd-torch-j8knc]
# steps:
# - run: echo "🎉 Begin Primus-Turbo Checkout."
# - name: Set commit hash to env
# run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
# - name: Checkout Repo Primus-Turbo
# uses: actions/checkout@v4
# with:
# repository: AMD-AIG-AIMA/Primus-Turbo
# submodules: "recursive"
# path: Primus-Turbo
# ref: ${{ env.PRIMUS_TURBO_COMMIT }}
# - run: echo "Begin Primus-Turbo Install."
# - name: Install Primus-Turbo
# run: |
# mv Primus-Turbo /tmp/
# echo "Primus-Turbo dir: /tmp/Primus-Turbo"
# git config --global --add safe.directory /tmp/Primus-Turbo
# cd /tmp/Primus-Turbo
# start_time=$(date +%s)
# echo "✅ [Pip install requirements] started at: $(date)"
# mkdir -p ${PRIMUS_WORKDIR}/primus-cache
# MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Torch/JAX unit test jobs are entirely commented out, which effectively disables CI test execution for PRs and can allow regressions to merge undetected. If this is only needed temporarily, consider gating the jobs behind an if: condition (e.g., on a label/branch) rather than removing them from the workflow, or re-enable them before merging.

Suggested change
# run-unittest-torch:
# env:
# PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch
# needs: [code-lint]
# runs-on: [primus-lm-cicd-torch-j8knc]
# steps:
# - run: echo "🎉 Begin Primus-Turbo Checkout."
# - name: Set commit hash to env
# run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
# - name: Checkout Repo Primus-Turbo
# uses: actions/checkout@v4
# with:
# repository: AMD-AIG-AIMA/Primus-Turbo
# submodules: "recursive"
# path: Primus-Turbo
# ref: ${{ env.PRIMUS_TURBO_COMMIT }}
# - run: echo "Begin Primus-Turbo Install."
# - name: Install Primus-Turbo
# run: |
# mv Primus-Turbo /tmp/
# echo "Primus-Turbo dir: /tmp/Primus-Turbo"
# git config --global --add safe.directory /tmp/Primus-Turbo
# cd /tmp/Primus-Turbo
# start_time=$(date +%s)
# echo "✅ [Pip install requirements] started at: $(date)"
# mkdir -p ${PRIMUS_WORKDIR}/primus-cache
# MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt
run-unittest-torch:
env:
PRIMUS_WORKDIR: /wekafs/primus-data/primus_safe_ci/torch
needs: [code-lint]
runs-on: [primus-lm-cicd-torch-j8knc]
steps:
- run: echo "🎉 Begin Primus-Turbo Checkout."
- name: Set commit hash to env
run: echo "PRIMUS_TURBO_COMMIT=${PRIMUS_TURBO_COMMIT}" >> $GITHUB_ENV
- name: Checkout Repo Primus-Turbo
uses: actions/checkout@v4
with:
repository: AMD-AIG-AIMA/Primus-Turbo
submodules: "recursive"
path: Primus-Turbo
ref: ${{ env.PRIMUS_TURBO_COMMIT }}
- run: echo "Begin Primus-Turbo Install."
- name: Install Primus-Turbo
run: |
mv Primus-Turbo /tmp/
echo "Primus-Turbo dir: /tmp/Primus-Turbo"
git config --global --add safe.directory /tmp/Primus-Turbo
cd /tmp/Primus-Turbo
start_time=$(date +%s)
echo "✅ [Pip install requirements] started at: $(date)"
mkdir -p ${PRIMUS_WORKDIR}/primus-cache
MAX_JOBS=128 pip install --cache-dir=${PRIMUS_WORKDIR}/primus-cache --no-build-isolation --no-clean -r requirements.txt

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +4
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script exports HF_TOKEN and WANDB_API_KEY as literal strings, which both encourages storing secrets in a committed file and also overwrites any values already present in the caller’s environment. Prefer reading these from the environment (and erroring if missing) rather than exporting placeholder values here.

Suggested change
export HF_TOKEN="your_hf_token" # make it your own hf token
export WANDB_API_KEY="your_wandb_api_key" # make it your own wandb api key
: "${HF_TOKEN:?Environment variable HF_TOKEN must be set}"
: "${WANDB_API_KEY:?Environment variable WANDB_API_KEY must be set}"
export HF_TOKEN
export WANDB_API_KEY

Copilot uses AI. Check for mistakes.
Comment on lines +197 to +215
# get the max rocm_mem_usage
usage_tensor = torch.tensor([r_used], device="cuda", dtype=torch.float32)
world_size = torch.distributed.get_world_size()
gathered_usage = [torch.zeros_like(usage_tensor) for _ in range(world_size)]
torch.distributed.all_gather(gathered_usage, usage_tensor)

rocm_mem_usages = [t.item() for t in gathered_usage]
max_usage = max(rocm_mem_usages)
max_rank = rocm_mem_usages.index(max_usage)

rocm_mem_str = (
f" | rocm mem usage/free/total/usage_ratio: "
f"{r_used / 1024 ** 3:.2f}GB/"
f"{r_free / 1024 ** 3:.2f}GB/"
f"{r_total / 1024 ** 3:.2f}GB/"
f"{r_ratio * 100:.2f}%"
f" | rank-{max_rank} rocm max mem usage/usage_ratio: "
f"{max_usage / 1024 ** 3:.2f}GB/"
f"{max_usage / r_total * 100:.2f}%"
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_usage / r_total uses the local rank’s r_total when computing the max-usage ratio, which can be incorrect if ranks have different total HBM sizes (or if totals differ for any reason). To make this accurate, gather each rank’s r_total (or gather precomputed r_used/r_total ratios) and compute the max ratio corresponding to max_rank.

Copilot uses AI. Check for mistakes.
# parallel
tensor_model_parallel_size: ${PRIMUS_TP:1}
pipeline_model_parallel_size: ${PRIMUS_PP:1}
pipeline_model_parallel_size: ${PRIMUS_PP:8}
Copy link

Copilot AI Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config changes the default pipeline_model_parallel_size from the common default of 1 to 8, which is inconsistent with the MI355X DeepSeek-V3 FP8 config and most other configs. To avoid surprising users (and breaking single-node runs), consider keeping the default at 1 and relying on PRIMUS_PP to opt into PP>1.

Suggested change
pipeline_model_parallel_size: ${PRIMUS_PP:8}
pipeline_model_parallel_size: ${PRIMUS_PP:1}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants